Demande de réservation d'hôtel -Analyse exploratoire des données et prédictions des annulations

Présentation du projet

Membres Du Groupe:

Ait ADDI Abdelghafour Chouki Amina Matallah Mohamed Walid

Nous avons tous déjà réservé dans un hotel et annulé la réservation par la suite. Nous avons donc décidé de mener notre analyse sur Un Data set du site booking et essayer de prédire les annulations de résérvations .Le Web scrapping sur un tel site est sensible et nécéssite une demande au préalable c'est pour cela on a préféré de travailler sur un data set prêt.

IMPORTATION DES LIBRAIRIES:

On commence par installer et importer les librairies dont on a besoin
In [1]:
pip install folium 
Requirement already satisfied: folium in c:\users\user\anaconda3\lib\site-packages (0.14.0)
Requirement already satisfied: branca>=0.6.0 in c:\users\user\anaconda3\lib\site-packages (from folium) (0.6.0)
Requirement already satisfied: jinja2>=2.9 in c:\users\user\anaconda3\lib\site-packages (from folium) (2.10.3)
Requirement already satisfied: numpy in c:\users\user\anaconda3\lib\site-packages (from folium) (1.16.5)
Requirement already satisfied: requests in c:\users\user\anaconda3\lib\site-packages (from folium) (2.22.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\user\anaconda3\lib\site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests->folium) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests->folium) (1.24.2)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests->folium) (2019.9.11)
Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install plotly.express
Requirement already satisfied: plotly.express in c:\users\user\anaconda3\lib\site-packages (0.4.1)
Requirement already satisfied: pandas>=0.20.0 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (0.25.1)
Requirement already satisfied: plotly>=4.1.0 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (5.14.1)
Requirement already satisfied: statsmodels>=0.9.0 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (0.10.1)
Requirement already satisfied: scipy>=0.18 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (1.3.1)
Requirement already satisfied: patsy>=0.5 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (0.5.1)
Requirement already satisfied: numpy>=1.11 in c:\users\user\anaconda3\lib\site-packages (from plotly.express) (1.16.5)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\user\anaconda3\lib\site-packages (from pandas>=0.20.0->plotly.express) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\user\anaconda3\lib\site-packages (from pandas>=0.20.0->plotly.express) (2019.3)
Requirement already satisfied: six in c:\users\user\anaconda3\lib\site-packages (from patsy>=0.5->plotly.express) (1.12.0)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\user\anaconda3\lib\site-packages (from plotly>=4.1.0->plotly.express) (8.2.2)
Requirement already satisfied: packaging in c:\users\user\anaconda3\lib\site-packages (from plotly>=4.1.0->plotly.express) (19.2)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\user\anaconda3\lib\site-packages (from packaging->plotly>=4.1.0->plotly.express) (2.4.2)
Note: you may need to restart the kernel to use updated packages.
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy as sp
import warnings
import datetime


warnings.filterwarnings("ignore")
%matplotlib inline

CHARGEMENT DU DATASET:

In [4]:
def load_data(filepath):
    return pd.read_csv(filepath)

data = load_data('hotel_bookings.csv')
In [5]:
data.head()
Out[5]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

on imprime le nombre de chaque valeur unique dans chaque colonne du dataframe

In [6]:
def count_values_in_each_column(df):
    for column in df.columns:
        print(f"Column: {column}")
        print(df[column].value_counts())
        print()

count_values_in_each_column(data)
Column: hotel
City Hotel      79330
Resort Hotel    40060
Name: hotel, dtype: int64

Column: is_canceled
0    75166
1    44224
Name: is_canceled, dtype: int64

Column: lead_time
0      6345
1      3460
2      2069
3      1816
4      1715
       ... 
458       1
371       1
737       1
435       1
387       1
Name: lead_time, Length: 479, dtype: int64

Column: arrival_date_year
2016    56707
2017    40687
2015    21996
Name: arrival_date_year, dtype: int64

Column: arrival_date_month
August       13877
July         12661
May          11791
October      11160
April        11089
June         10939
September    10508
March         9794
February      8068
November      6794
December      6780
January       5929
Name: arrival_date_month, dtype: int64

Column: arrival_date_week_number
33    3580
30    3087
32    3045
34    3040
18    2926
21    2854
28    2853
17    2805
20    2785
29    2763
42    2756
31    2741
41    2699
15    2689
27    2664
25    2663
38    2661
23    2621
35    2593
39    2581
22    2546
24    2498
13    2416
16    2405
19    2402
40    2397
26    2391
43    2352
44    2272
14    2264
37    2229
8     2216
36    2167
10    2149
9     2117
7     2109
12    2083
11    2070
45    1941
53    1816
49    1782
47    1685
46    1574
6     1508
50    1505
48    1504
4     1487
5     1387
3     1319
2     1218
52    1195
1     1047
51     933
Name: arrival_date_week_number, dtype: int64

Column: arrival_date_day_of_month
17    4406
5     4317
15    4196
25    4160
26    4147
9     4096
12    4087
16    4078
2     4055
19    4052
20    4032
18    4002
24    3993
28    3946
8     3921
3     3855
30    3853
6     3833
14    3819
27    3802
21    3767
4     3763
13    3745
7     3665
1     3626
23    3616
11    3599
22    3596
29    3580
10    3575
31    2208
Name: arrival_date_day_of_month, dtype: int64

Column: stays_in_weekend_nights
0     51998
2     33308
1     30626
4      1855
3      1259
6       153
5        79
8        60
7        19
9        11
10        7
12        5
13        3
16        3
14        2
18        1
19        1
Name: stays_in_weekend_nights, dtype: int64

Column: stays_in_week_nights
2     33684
1     30310
3     22258
5     11077
4      9563
0      7645
6      1499
10     1036
7      1029
8       656
9       231
15       85
11       56
19       44
12       42
20       41
14       35
13       27
16       16
21       15
22        7
18        6
25        6
30        5
17        4
24        3
40        2
42        1
26        1
32        1
33        1
34        1
35        1
41        1
50        1
Name: stays_in_week_nights, dtype: int64

Column: adults
2     89680
1     23027
3      6202
0       403
4        62
26        5
27        2
20        2
5         2
55        1
50        1
40        1
10        1
6         1
Name: adults, dtype: int64

Column: children
0.0     110796
1.0       4861
2.0       3652
3.0         76
10.0         1
Name: children, dtype: int64

Column: babies
0     118473
1        900
2         15
10         1
9          1
Name: babies, dtype: int64

Column: meal
BB           92310
HB           14463
SC           10650
Undefined     1169
FB             798
Name: meal, dtype: int64

Column: country
PRT    48590
GBR    12129
FRA    10415
ESP     8568
DEU     7287
       ...  
SMR        1
PYF        1
MLI        1
NAM        1
MMR        1
Name: country, Length: 177, dtype: int64

Column: market_segment
Online TA        56477
Offline TA/TO    24219
Groups           19811
Direct           12606
Corporate         5295
Complementary      743
Aviation           237
Undefined            2
Name: market_segment, dtype: int64

Column: distribution_channel
TA/TO        97870
Direct       14645
Corporate     6677
GDS            193
Undefined        5
Name: distribution_channel, dtype: int64

Column: is_repeated_guest
0    115580
1      3810
Name: is_repeated_guest, dtype: int64

Column: previous_cancellations
0     112906
1       6051
2        116
3         65
24        48
11        35
4         31
26        26
25        25
6         22
19        19
5         19
14        14
13        12
21         1
Name: previous_cancellations, dtype: int64

Column: previous_bookings_not_canceled
0     115770
1       1542
2        580
3        333
4        229
       ...  
47         1
36         1
49         1
50         1
63         1
Name: previous_bookings_not_canceled, Length: 73, dtype: int64

Column: reserved_room_type
A    85994
D    19201
E     6535
F     2897
G     2094
B     1118
C      932
H      601
P       12
L        6
Name: reserved_room_type, dtype: int64

Column: assigned_room_type
A    74053
D    25322
E     7806
F     3751
G     2553
C     2375
B     2163
H      712
I      363
K      279
P       12
L        1
Name: assigned_room_type, dtype: int64

Column: booking_changes
0     101314
1      12701
2       3805
3        927
4        376
5        118
6         63
7         31
8         17
9          8
10         6
13         5
14         5
15         3
11         2
12         2
16         2
17         2
20         1
18         1
21         1
Name: booking_changes, dtype: int64

Column: deposit_type
No Deposit    104641
Non Refund     14587
Refundable       162
Name: deposit_type, dtype: int64

Column: agent
9.0      31961
240.0    13922
1.0       7191
14.0      3640
7.0       3539
         ...  
213.0        1
433.0        1
197.0        1
367.0        1
337.0        1
Name: agent, Length: 333, dtype: int64

Column: company
40.0     927
223.0    784
67.0     267
45.0     250
153.0    215
        ... 
229.0      1
213.0      1
416.0      1
320.0      1
461.0      1
Name: company, Length: 352, dtype: int64

Column: days_in_waiting_list
0      115692
39        227
58        164
44        141
31        127
        ...  
175         1
117         1
89          1
92          1
183         1
Name: days_in_waiting_list, Length: 128, dtype: int64

Column: customer_type
Transient          89613
Transient-Party    25124
Contract            4076
Group                577
Name: customer_type, dtype: int64

Column: adr
62.00     3754
75.00     2715
90.00     2473
65.00     2418
0.00      1959
          ... 
202.74       1
87.64        1
69.83        1
160.83       1
35.64        1
Name: adr, Length: 8879, dtype: int64

Column: required_car_parking_spaces
0    111974
1      7383
2        28
3         3
8         2
Name: required_car_parking_spaces, dtype: int64

Column: total_of_special_requests
0    70318
1    33226
2    12969
3     2497
4      340
5       40
Name: total_of_special_requests, dtype: int64

Column: reservation_status
Check-Out    75166
Canceled     43017
No-Show       1207
Name: reservation_status, dtype: int64

Column: reservation_status_date
2015-10-21    1461
2015-07-06     805
2016-11-25     790
2015-01-01     763
2016-01-18     625
              ... 
2015-02-24       1
2015-04-07       1
2015-04-25       1
2015-03-13       1
2015-03-29       1
Name: reservation_status_date, Length: 926, dtype: int64

on imprime le nom des colonnes présentes dans le dataframe

In [7]:
def get_column_names(df):
    return df.columns.tolist()

columns = get_column_names(data)
print(columns)
['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

on imprime le type de chaque colonne

In [8]:
def get_data_types(df):
    return {column: dtype.name for column, dtype in df.dtypes.iteritems()}

data_types = get_data_types(data)
print(data_types)
{'hotel': 'object', 'is_canceled': 'int64', 'lead_time': 'int64', 'arrival_date_year': 'int64', 'arrival_date_month': 'object', 'arrival_date_week_number': 'int64', 'arrival_date_day_of_month': 'int64', 'stays_in_weekend_nights': 'int64', 'stays_in_week_nights': 'int64', 'adults': 'int64', 'children': 'float64', 'babies': 'int64', 'meal': 'object', 'country': 'object', 'market_segment': 'object', 'distribution_channel': 'object', 'is_repeated_guest': 'int64', 'previous_cancellations': 'int64', 'previous_bookings_not_canceled': 'int64', 'reserved_room_type': 'object', 'assigned_room_type': 'object', 'booking_changes': 'int64', 'deposit_type': 'object', 'agent': 'float64', 'company': 'float64', 'days_in_waiting_list': 'int64', 'customer_type': 'object', 'adr': 'float64', 'required_car_parking_spaces': 'int64', 'total_of_special_requests': 'int64', 'reservation_status': 'object', 'reservation_status_date': 'object'}

On vérifie s'il y'a des valeurs manquantes dans notre Dataframe

In [9]:
data.isnull().any()
Out[9]:
hotel                             False
is_canceled                       False
lead_time                         False
arrival_date_year                 False
arrival_date_month                False
arrival_date_week_number          False
arrival_date_day_of_month         False
stays_in_weekend_nights           False
stays_in_week_nights              False
adults                            False
children                           True
babies                            False
meal                              False
country                            True
market_segment                    False
distribution_channel              False
is_repeated_guest                 False
previous_cancellations            False
previous_bookings_not_canceled    False
reserved_room_type                False
assigned_room_type                False
booking_changes                   False
deposit_type                      False
agent                              True
company                            True
days_in_waiting_list              False
customer_type                     False
adr                               False
required_car_parking_spaces       False
total_of_special_requests         False
reservation_status                False
reservation_status_date           False
dtype: bool
In [10]:
data.isnull().sum()
Out[10]:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

effectivement il existe des valeurs manquantes dans notre dataframe, on remplacera ces par O dans le code qui suit.

In [11]:
data.fillna(0, inplace = True)
In [12]:
import seaborn as sns

plt.figure(figsize=(12,8))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.show()

le graphe ci dessus prouve qu'il n'ya plus de NA.

In [13]:
data.isnull().sum()
Out[13]:
hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
agent                             0
company                           0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
reservation_status                0
reservation_status_date           0
dtype: int64

On crée un nouveau DataFrame country_wise_guests qui compte le nombre d'invités par pays qui n'ont pas annulé leur réservation.

In [14]:
pays_invités_nannulé= data[data['is_canceled'] == 0].groupby('country').size().reset_index(name='No of guests')
pays_invités_nannulé
Out[14]:
country No of guests
0 0 421
1 ABW 2
2 AGO 157
3 AIA 1
4 ALB 10
... ... ...
161 VEN 14
162 VNM 6
163 ZAF 49
164 ZMB 1
165 ZWE 2

166 rows × 2 columns

In [15]:
import folium
from folium.plugins import HeatMap
import plotly.express as px
In [17]:
basemap = folium.Map()
guests_map = px.choropleth(pays_invités_nannulé, locations = pays_invités_nannulé['country'],
                           color = pays_invités_nannulé['No of guests'], hover_name = pays_invités_nannulé['country'])
guests_map.show()
Des gens du monde entier séjournent dans ces deux hôtels. La plupart des clients viennent du Portugal et d'autres pays d'Europe

Quel est le prix d'une chambre par nuit?

In [18]:
import plotly.express as px

def plot_box(df, filter_col, filter_val, x, y, color):
    filtered_df = df[df[filter_col] == filter_val]
    fig = px.box(data_frame = filtered_df, x = x, y = y, color = color)
    fig.show()

plot_box(data, 'is_canceled', 0, 'reserved_room_type', 'adr', 'hotel')

La graphe montre que le prix moyen par chambre dépend du type de chambre et de l'écart-type.

On crée deux nouveaux DataFrames, data_resort et data_city, qui représentent respectivement les réservations non annulées pour l'hôtel de type "Resort Hotel" et "City Hotel". Ensuite, on compte le nombre d'invités qui sont arrivés chaque mois à l'hôtel de type "Resort Hotel" et stocke ces informations dans un nouveau DataFrame, resort_guests.

In [19]:
data_resort = data.query("`hotel` == 'Resort Hotel' and `is_canceled` == 0")
data_city = data.query("`hotel` == 'City Hotel' and `is_canceled` == 0")

# Calcul du nombre de clients par mois pour les hôtels de type "Resort Hotel" et stockage du résultat dans un nouveau DataFrame.
resort_invité = pd.DataFrame(data_resort.groupby('arrival_date_month').size(), columns=['no of guests'])
resort_invité.reset_index(level=0, inplace=True)
resort_invité.rename(columns={'arrival_date_month':'month'}, inplace=True)

resort_invité
Out[19]:
month no of guests
0 April 2550
1 August 3257
2 December 2017
3 February 2308
4 January 1868
5 July 3137
6 June 2038
7 March 2573
8 May 2535
9 November 1976
10 October 2577
11 September 2102

on compte le nombre d'invités qui sont arrivés chaque mois à l'hôtel de type "City Hotel" et stocke ces informations dans un nouveau DataFrame, city_guests.

In [20]:
# Calcul du nombre de clients par mois pour les hôtels de type "City Hotel" et stockage du résultat dans un nouveau DataFrame.
city_visiteur = pd.DataFrame(data_city.groupby('arrival_date_month').size(), columns=['no of guests'])
city_visiteur.reset_index(level=0, inplace=True)
city_visiteur.rename(columns={'arrival_date_month':'month'}, inplace=True)

city_visiteur
Out[20]:
month no of guests
0 April 4015
1 August 5381
2 December 2392
3 February 3064
4 January 2254
5 July 4782
6 June 4366
7 March 4072
8 May 4579
9 November 2696
10 October 4337
11 September 4290

On crée un nouveau DataFrame, final_guests, qui est le résultat de la fusion des DataFrames resort_guests et city_guests sur la colonne 'month'.

In [23]:
# Fusionner les deux DataFrame sur le mois et renommer les colonnes
total_visiteurs = pd.concat([resort_invité.set_index('month'), city_visiteur.set_index('month')], axis=1, keys=['Resort', 'City'])
total_visiteurs.columns = ['nombre de clients dans resort', 'nombre de clients dans city hotel']

# Réinitialiser l'index pour que 'month' redevienne une colonne
total_visiteurs.reset_index(level=0, inplace=True)
total_visiteurs.rename(columns={'index':'month'}, inplace=True)

total_visiteurs
Out[23]:
month nombre de clients dans resort nombre de clients dans city hotel
0 April 2550 4015
1 August 3257 5381
2 December 2017 2392
3 February 2308 3064
4 January 1868 2254
5 July 3137 4782
6 June 2038 4366
7 March 2573 4072
8 May 2535 4579
9 November 1976 2696
10 October 2577 4337
11 September 2102 4290

On le DataFrame final_guests en fonction de la colonne 'month'.

In [24]:
# Définition d'un ordre personnalisé pour les mois de l'année
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Conversion de la colonne 'month' en catégorie avec l'ordre personnalisé
total_visiteurs['month'] = pd.Categorical(total_visiteurs['month'], categories=months_order, ordered=True)

# Trier le DataFrame en fonction de l'ordre des mois
total_visiteurs = total_visiteurs.sort_values('month')

total_visiteurs
Out[24]:
month nombre de clients dans resort nombre de clients dans city hotel
4 January 1868 2254
3 February 2308 3064
7 March 2573 4072
0 April 2550 4015
8 May 2535 4579
6 June 2038 4366
5 July 3137 4782
1 August 3257 5381
11 September 2102 4290
10 October 2577 4337
9 November 1976 2696
2 December 2017 2392
In [25]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))

sns.lineplot(x='month', y='nombre de clients dans resort', data=total_visiteurs, label='nombre de clients dans resort')
sns.lineplot(x='month', y='nombre de clients dans city hotel', data=total_visiteurs, label='nombre de clients dans city hotel ')

plt.title('Total nombre de clients par mois')
plt.xticks(rotation=45)
plt.legend(loc='upper right')

plt.show()

L'hôtel City accueille plus de visiteurs au printemps et en automne, lorsque les prix sont également les plus élevés. En juillet et en août, le nombre de visiteurs est moins élevé, bien que les prix soient plus bas. La fréquentation de l'hôtel Resort diminue légèrement de juin à septembre, période où les prix sont également les plus élevés. C'est en hiver que les deux hôtels accueillent le moins de visiteurs

In [26]:
data_3 = data.query("is_canceled == 0")
data_3.head()
Out[26]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit 0.0 0.0 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit 0.0 0.0 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit 0.0 0.0 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 0.0 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 0.0 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

On crée un nouveau DataFrame data_3 qui contient uniquement les lignes du DataFrame original data où la valeur de la colonne 'is_canceled' est égale à 0. En d'autres termes, data_3 ne contient que les réservations qui n'ont pas été annulées

In [27]:
data_3 = data_3.assign(total_nights = data_3['stays_in_weekend_nights'] + data_3['stays_in_week_nights'])
data_3.head()
Out[27]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date total_nights
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... 0.0 0.0 0 Transient 0.0 0 0 Check-Out 2015-07-01 0
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... 0.0 0.0 0 Transient 0.0 0 0 Check-Out 2015-07-01 0
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... 0.0 0.0 0 Transient 75.0 0 0 Check-Out 2015-07-02 1
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... 304.0 0.0 0 Transient 75.0 0 0 Check-Out 2015-07-02 1
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... 240.0 0.0 0 Transient 98.0 0 1 Check-Out 2015-07-03 2

5 rows × 33 columns

In [28]:
duree = (data_3
        .groupby(['total_nights', 'hotel'])
        .size()
        .reset_index(name='duree'))
duree
Out[28]:
total_nights hotel duree
0 0 City Hotel 308
1 0 Resort Hotel 372
2 1 City Hotel 9169
3 1 Resort Hotel 6580
4 2 City Hotel 10992
... ... ... ...
63 49 City Hotel 1
64 56 Resort Hotel 1
65 57 City Hotel 1
66 60 Resort Hotel 1
67 69 Resort Hotel 1

68 rows × 3 columns

In [29]:
import matplotlib.pyplot as plt

duree_pivot = duree.pivot(index='total_nights', columns='hotel', values='duree')
duree_pivot.plot(kind='bar', stacked=False)

plt.xlabel('Total Nights')
plt.ylabel('duree')
plt.title('Total Nights vs duree pour chaque hotel')
plt.show()
In [30]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize = (24, 12))

# Calcule la matrice de corrélation
corr = data.corr()

# Utilise matshow pour créer une matrice de couleur
plt.matshow(corr, cmap='RdBu')

# Crée une échelle de couleur
plt.colorbar()

# Ajoute des ticks pour chaque valeur et les fait pivoter pour une meilleure lisibilité
plt.xticks(np.arange(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(np.arange(len(corr.columns)), corr.columns)

plt.show()
<Figure size 1728x864 with 0 Axes>
In [31]:
correlation = data.corr()['is_canceled'].abs().sort_values(ascending = False)
correlation
Out[31]:
is_canceled                       1.000000
lead_time                         0.293123
total_of_special_requests         0.234658
required_car_parking_spaces       0.195498
booking_changes                   0.144381
previous_cancellations            0.110133
is_repeated_guest                 0.084793
company                           0.082995
adults                            0.060017
previous_bookings_not_canceled    0.057358
days_in_waiting_list              0.054186
adr                               0.047557
agent                             0.046529
babies                            0.032491
stays_in_week_nights              0.024765
arrival_date_year                 0.016660
arrival_date_week_number          0.008148
arrival_date_day_of_month         0.006130
children                          0.005036
stays_in_weekend_nights           0.001791
Name: is_canceled, dtype: float64

On fait une heatmap correlation pour voir comment les colonnes sont corréles ceci qui nous permettera de supprimer les colonnes qui ne corrélent pas avec la variable is canceled=0 qu'on juge inutile.

In [32]:
colonnes_inutiles = ['days_in_waiting_list', 'arrival_date_year', 'arrival_date_year', 'assigned_room_type', 'booking_changes',
               'reservation_status', 'country', 'days_in_waiting_list']

data.drop(colonnes_inutiles, axis = 1, inplace = True)
In [33]:
cat_cols = [col for col in data.columns if data[col].dtype == 'O']
cat_cols
Out[33]:
['hotel',
 'arrival_date_month',
 'meal',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'deposit_type',
 'customer_type',
 'reservation_status_date']
In [34]:
cat_df = data[cat_cols]
cat_df['reservation_status_date'] = pd.to_datetime(cat_df['reservation_status_date'])

cat_df['year'] = cat_df['reservation_status_date'].dt.year
cat_df['month'] = cat_df['reservation_status_date'].dt.month
cat_df['day'] = cat_df['reservation_status_date'].dt.day
In [35]:
cat_df.drop(['reservation_status_date','arrival_date_month'] , axis = 1, inplace = True)
In [36]:
cat_df['hotel'] = cat_df['hotel'].map({'Resort Hotel' : 0, 'City Hotel' : 1})

cat_df['meal'] = cat_df['meal'].map({'BB' : 0, 'FB': 1, 'HB': 2, 'SC': 3, 'Undefined': 4})

cat_df['market_segment'] = cat_df['market_segment'].map({'Direct': 0, 'Corporate': 1, 'Online TA': 2, 'Offline TA/TO': 3,
                                                           'Complementary': 4, 'Groups': 5, 'Undefined': 6, 'Aviation': 7})

cat_df['distribution_channel'] = cat_df['distribution_channel'].map({'Direct': 0, 'Corporate': 1, 'TA/TO': 2, 'Undefined': 3,
                                                                       'GDS': 4})

cat_df['reserved_room_type'] = cat_df['reserved_room_type'].map({'C': 0, 'A': 1, 'D': 2, 'E': 3, 'G': 4, 'F': 5, 'H': 6,
                                                                   'L': 7, 'B': 8})

cat_df['deposit_type'] = cat_df['deposit_type'].map({'No Deposit': 0, 'Refundable': 1, 'Non Refund': 3})

cat_df['customer_type'] = cat_df['customer_type'].map({'Transient': 0, 'Contract': 1, 'Transient-Party': 2, 'Group': 3})

cat_df['year'] = cat_df['year'].map({2015: 0, 2014: 1, 2016: 2, 2017: 3})
In [37]:
cat_df.head()
Out[37]:
hotel meal market_segment distribution_channel reserved_room_type deposit_type customer_type year month day
0 0 0 0 0 0.0 0 0 0 7 1
1 0 0 0 0 0.0 0 0 0 7 1
2 0 0 0 0 1.0 0 0 0 7 2
3 0 0 1 1 1.0 0 0 0 7 2
4 0 0 2 2 1.0 0 0 0 7 3
In [38]:
num_df = data.drop(columns = cat_cols, axis = 1)
num_df.drop('is_canceled', axis = 1, inplace = True)
num_df
Out[38]:
lead_time arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled agent company adr required_car_parking_spaces total_of_special_requests
0 342 27 1 0 0 2 0.0 0 0 0 0 0.0 0.0 0.00 0 0
1 737 27 1 0 0 2 0.0 0 0 0 0 0.0 0.0 0.00 0 0
2 7 27 1 0 1 1 0.0 0 0 0 0 0.0 0.0 75.00 0 0
3 13 27 1 0 1 1 0.0 0 0 0 0 304.0 0.0 75.00 0 0
4 14 27 1 0 2 2 0.0 0 0 0 0 240.0 0.0 98.00 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 23 35 30 2 5 2 0.0 0 0 0 0 394.0 0.0 96.14 0 0
119386 102 35 31 2 5 3 0.0 0 0 0 0 9.0 0.0 225.43 0 2
119387 34 35 31 2 5 2 0.0 0 0 0 0 9.0 0.0 157.71 0 4
119388 109 35 31 2 5 2 0.0 0 0 0 0 89.0 0.0 104.40 0 0
119389 205 35 29 2 7 2 0.0 0 0 0 0 9.0 0.0 151.20 0 2

119390 rows × 16 columns

In [39]:
num_df['adr'] = num_df['adr'].fillna(value = num_df['adr'].mean())
In [40]:
X = pd.concat([cat_df, num_df], axis = 1)
y = data['is_canceled']
In [41]:
missing_data_X = X.isnull().sum()
print(missing_data_X)
hotel                              0
meal                               0
market_segment                     0
distribution_channel               0
reserved_room_type                12
deposit_type                       0
customer_type                      0
year                               0
month                              0
day                                0
lead_time                          0
arrival_date_week_number           0
arrival_date_day_of_month          0
stays_in_weekend_nights            0
stays_in_week_nights               0
adults                             0
children                           0
babies                             0
is_repeated_guest                  0
previous_cancellations             0
previous_bookings_not_canceled     0
agent                              0
company                            0
adr                                0
required_car_parking_spaces        0
total_of_special_requests          0
dtype: int64
In [42]:
missing_data_y = y.isnull().sum()
print(missing_data_y)
0
In [43]:
# Gérer les valeurs manquantes et les valeurs infinies dans X 
# Remplacer les valeurs infinies par NaN
X = X.replace([np.inf, -np.inf], np.nan)
# Remplacer les valeurs manquantes par la moyenne des autres valeurs
X = X.fillna(X.mean())  
In [44]:
missing_data_X = X.isnull().sum()
print(missing_data_X)
hotel                             0
meal                              0
market_segment                    0
distribution_channel              0
reserved_room_type                0
deposit_type                      0
customer_type                     0
year                              0
month                             0
day                               0
lead_time                         0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
agent                             0
company                           0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
dtype: int64

Partie Prédcition:

logistic regression

In [45]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
In [46]:
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(X_train,y_train)                         
Out[46]:
LogisticRegression()
In [47]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred_reg = reg.predict(X_test)
acc_reg = accuracy_score(y_test, y_pred_reg)
conf = confusion_matrix(y_test, y_pred_reg)
clf_report = classification_report(y_test, y_pred_reg)
train_score = reg.score(X_train, y_train) * 100

print("Classification Report:")
print(clf_report)
print("\nConfusion Matrix:")
print(conf)
print("\nTraining Score: {:.2f}%".format(train_score))
print("Accuracy Score of Logistic Regression: {:.2f}%".format(acc_reg * 100))
Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.95      0.85     14934
           1       0.85      0.53      0.66      8944

    accuracy                           0.79     23878
   macro avg       0.81      0.74      0.75     23878
weighted avg       0.80      0.79      0.78     23878


Confusion Matrix:
[[14127   807]
 [ 4191  4753]]

Training Score: 79.08%
Accuracy Score of Logistic Regression: 79.07%

Random forest

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
Out[48]:
RandomForestClassifier()
In [49]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_pred_rfc = rfc.predict(X_test)
conf = confusion_matrix(y_test, y_pred_rfc)
clf = classification_report(y_test, y_pred_rfc)
score = accuracy_score(y_test, y_pred_rfc)

print("Confusion Matrix:\n", conf)
print("\nClassification Report:\n", clf)
print("Accuracy Score:", score)
Confusion Matrix:
 [[14812   122]
 [  926  8018]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     14934
           1       0.99      0.90      0.94      8944

    accuracy                           0.96     23878
   macro avg       0.96      0.94      0.95     23878
weighted avg       0.96      0.96      0.96     23878

Accuracy Score: 0.9561102269871848

Decision tree

In [50]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=6, random_state=123,criterion='entropy')

dtree.fit(X_train,y_train)
Out[50]:
DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=123)
In [51]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_pred_dtree = dtree.predict(X_test)
conf = confusion_matrix(y_test, y_pred_dtree)
clf = classification_report(y_test, y_pred_dtree)
score = accuracy_score(y_test, y_pred_dtree)

print("Confusion Matrix:\n", conf)
print("\nClassification Report:\n", clf)
print("Accuracy Score:", score)
Confusion Matrix:
 [[13118  1816]
 [ 3254  5690]]

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.88      0.84     14934
           1       0.76      0.64      0.69      8944

    accuracy                           0.79     23878
   macro avg       0.78      0.76      0.76     23878
weighted avg       0.79      0.79      0.78     23878

Accuracy Score: 0.7876706591841863
In [52]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculer les scores d'exactitude pour chaque méthode
acc_reg = accuracy_score(y_test, y_pred_reg)
acc_rfc = accuracy_score(y_test, y_pred_rfc)
acc_dtc = accuracy_score(y_test, y_pred_dtree)

# Créer un DataFrame pour stocker les méthodes et les scores d'exactitude
data = {'Method': ['Logistic Regression', 'Random Forest Classifier', 'Decision Tree Classifier'],
        'Accuracy Score': [acc_reg, acc_rfc, acc_dtc]}
df = pd.DataFrame(data)

# Trier les méthodes par ordre décroissant des scores d'exactitude
df = df.sort_values(by='Accuracy Score', ascending=False)

# Créer un graphique à barres horizontales avec Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy Score', y='Method', data=df, palette='viridis')
plt.xlabel('Accuracy Score')
plt.ylabel('Method')
plt.title('Comparison of Accuracy Scores')
plt.show()

ce graphe nous montre que random forest classifier est la plus éfficace avec une accurancy trés élévée.